-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[BOLT] Optimize basic block loops to avoid n^2 loop #156243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-bolt Author: Mark Rousskov (Mark-Simulacrum) ChangesThis improves BOLT runtime when optimizing rustc_driver.so from 15 minutes to 7 minutes (or 49 minutes to 37 minutes of userspace time). Full diff: https://github.com/llvm/llvm-project/pull/156243.diff 1 Files Affected:
diff --git a/bolt/lib/Core/BinaryFunction.cpp b/bolt/lib/Core/BinaryFunction.cpp
index 6cac2d0cca2cb..a86e204cae974 100644
--- a/bolt/lib/Core/BinaryFunction.cpp
+++ b/bolt/lib/Core/BinaryFunction.cpp
@@ -3591,6 +3591,18 @@ void BinaryFunction::fixBranches() {
auto &MIB = BC.MIB;
MCContext *Ctx = BC.Ctx.get();
+ // Caches `FunctionLayout::nextBasicBlock(IgnoreSplits = false)`.
+ // nextBasicBlock uses linear search to find the next block, so the loop
+ // below becomes O(n^2). This avoids that.
+ DenseMap<BinaryBasicBlock *, BinaryBasicBlock *> nextBasicBlock(
+ Layout.block_size());
+ for (size_t i = 0; i + 1 < Layout.block_size(); i++) {
+ auto current = Layout.block_begin() + i;
+ auto next = Layout.block_begin() + i + 1;
+ if (next != Layout.getFragment((*current)->getFragmentNum()).end())
+ nextBasicBlock.insert(std::pair(*current, *next));
+ }
+
for (BinaryBasicBlock *BB : BasicBlocks) {
const MCSymbol *TBB = nullptr;
const MCSymbol *FBB = nullptr;
@@ -3605,7 +3617,7 @@ void BinaryFunction::fixBranches() {
// Basic block that follows the current one in the final layout.
const BinaryBasicBlock *const NextBB =
- Layout.getBasicBlockAfter(BB, /*IgnoreSplits=*/false);
+ nextBasicBlock.lookup_or(BB, nullptr);
if (BB->succ_size() == 1) {
// __builtin_unreachable() could create a conditional branch that
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Mark-Simulacrum,
Thanks for your patch. Tests appear to hang, however, your current version includes 69ccc39 (now reverted) which caused some issues.
Could you rebase to latest main so we can verify that?
This improves BOLT runtime when optimizing rustc_driver.so from 15 minutes to 7 minutes (49 minutes to 37 minutes of userspace time).
f833920
to
480d48e
Compare
Rebased. Can you clarify how you're running tests, so I can iterate locally as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Mark,
Thanks for rebasing. I can now verify that tests complete.
There are a few more users of getBasicBlockAfter
in loops (the call you cached) that might also benefit from this.
I am not sure if it's worth wrapping this in a helper that returns the cache, possibly parameterized by IgnoreSplits
as well. Then others can re-use.
You can use the ninja check-bolt
target to run the unit / lit tests.
This enables future re-use in other code that calls getBasicBlockAfter in loops, though for now those uses aren't introduced.
I split the cache creation out into a dedicated function. I think in at least the librustc_driver.so case, none of the other usages show up as hot, so it's potentially not worth building the cache upfront for them. |
Thanks for the patch. The time it takes to run BOLT sounds excessive, even 7 minutes is a lot. Could you share BOLT log with |
I ran it on Rust's CI. This is the log for LLVM: LLVM
And here for the Rust compiler's shared library: rustc
|
Thanks for sharing the logs. Most of the time is spent in split functions optimizations, due to While the CDSplit algorithm can produce the best layout, the improvement over the "regular" function split rarely exceeds 0.5%. Before we speedup CDSplit, it's a tradeoff between the build time and the application performance. Were you able to measure improvement by using CDSplit? If not, you can speed up BOLT processing times significantly by disabling it. I also recommend using |
@Mark-Simulacrum, thanks for identifying the bottleneck and the fix! A slightly cleaner change would be to make |
I'd prefer to leave it to you, not comfortable enough editing this code too significantly (both due to C++ and unfamiliarity with the surroundings). Happy to close this PR or have you merge it (I don't have permission to do so) and then build atop. |
It was actually suggested by Amir 😆 (rust-lang/rust#119418). We saw relatively decent max RSS wins with CDsplit. |
Iterator implementation of PR llvm#156243: This improves BOLT runtime when optimizing rustc_driver.so from 15 minutes to 7 minutes (or 49 minutes to 37 minutes of userspace time). Co-authored-by: Mark-Simulacrum <mark.simulacrum@gmail.com>
Nice! How confident are you in RSS wins? This comes as a surprise and I'd like to know what happened once it's confirmed. The improvement in CPU cycles is expected though. |
Very confident, it was across the board and we don't see these kinds of numbers out of the blue. |
It's possible that "warm" code that gets isolated by CDSplit is not touched by your benchmarks. Do you know if in terms of absolute numbers RSS wins are comparable to the size of |
Iterator implementation of PR #156243: This improves BOLT runtime when optimizing rustc_driver.so from 15 minutes to 7 minutes (or 49 minutes to 37 minutes of userspace time). Co-authored-by: Mark-Simulacrum <mark.simulacrum@gmail.com>
The size of |
Thanks for checking. 30 MiB is hard to explain, unless it's a number gathered across multiple processes, e.g. in a case of a parallel build. If it's the number for a single process, it must be some side effect. I find it hard to believe it's a direct impact from a code layout optimization. |
Did you ever try these options: |
Well, hard to say then. Tbh the main issue with BOLT that we have is not RSS or even the BOLT optimization time, but the fact that it doubles the binary size of the optimized artifacts, due to I suppose that this PR can be closed, now that the change has landed in #160407? |
We haven't tried this exact combination, I'll try it, thanks. The issue we had with |
This improves BOLT runtime when optimizing rustc_driver.so from 15 minutes to 7 minutes (or 49 minutes to 37 minutes of userspace time).